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Forecast  scoring  has  had  a  history  of  frustration,  as  well  as  a  measure  of 
success,  with  little  universal  agreement  on  how  it  should  be  achieved.  Conse¬ 
quently,  interest  in  the  subject  has  increased  and  decreased  intermittently  over 
the  year 8.  In  April  1979,  a  special  panel  of  the  Committee  on  Atmospheric 
Sciences  of  the  National  Research  Council  (NRC)  wrote  a  report  on  "Long-Range 
Weather  Forecast  Evaluation,  "  which  contained  a  recommended  scoring  system 
for  the  forecasts  of  weekly  average  temperature  at  several  locations  in  the  United 
States,  forecast  one  month  in  advance.  The  NRC  report  became  a  starting  point 
for  a  renewed  effort  on  scoring  methods,  both  at  United  States  Air  Force  Environ¬ 
mental  Technical  Applications  Center  (USAFETAC)  and  Air  Force  Geophysics 
Laboratory  (AFGL).  The  authors  of  this  report,  representing  the  last  two  organi¬ 
zations,  offer  a  procedure  that  shows  promise  on  extended  forecasting.  The 
prior  interest  in  the  subject  by  Col  Gary  Atkinson,  AWS,  and  the  several  earlier 
investigations  by  Lt  Col  Gerald  J.  Dittberner  and  others  at  ETAC  are  acknowledged 
both  for  their  inspiration  and  their  guidance. 
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The  B-G  System  of  Evaluating  Forecasts 


1.  INTRODUCTION 

The  effectiveness  of  weather  forecasting,  12  hours  to  several  days  in  advance, 
is  well-established.  But  how  much  longer  in  advance  are  forecasts  worthwhile  ? 
How  shall  we  decide  which  of  several  competing  predictions  deserves  the  highest 
rating  ?  Indeed,  how  should  forecasts  be  judged  ?  By  accuracy,  by  usefulness  or 
by  something  else  that  we  might  call  the  skill  or  expertise  of  the  forecaster  ? 
Should  forecasters  be  allowed  to  hedge,  by  stating  probabilities  or  otherwise? 
What  constitutes  a  verification?  How  are  errors  in  the  forecast  measured? 

The  whole  subject  of  scoring  and  evaluating  forecasts  is  riddled  with  prob¬ 
lems,  complications,  doubts,  and  skepticism.  Usually  efforts  to  devise  a  scheme 
of  scoring,  objective  or  otherwise,  have  suffered  rejection,  sometimes  by  the 
most  professional  of  meteorologists.  At  the  same  time  the  need  for  such  evalua¬ 
tion  has  been  inescapable  and  will  not  fade  away.  *  Moreover,  there  has  been  no 
2  3 

dearth  of  schemes,  new  or  old.  In  particular,  an  acceptable  objective  system 
(Received  for  publication  30  December  1981) 

1.  Nap,  J.  L. ,  Van  den  Dool,  H.  M. ,  Oerlemens,  J.  (1981)  A  verification  of 

monthly  weather  forecasts  in  the  seventies.  Monthly  Weather  Review  109: 
306-312.  - 

2.  Gulezian,  Dean  P.  (1981)  A  new  verification  score  for  public  forecasts, 

Monthly  Weather  Review  109:313-323. 

3.  Gringorten,  Irving  I.  (1965)  A  measure  of  skill  in  forecasting  a  continuous 

variable,  J.  Appi.  Meteorol.  4:47-53. 
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has  been  sought  for  long-range  forecasting  in  order  to  answer  the  question  about 
the  very  existence  of  effective  expertise  in  the  forecasts  (see  Appendix  A). 

Forecasters  might  be  rated  for  their  accuracy,  usefulness  or  professional 
skill.  The  degree  of  accuracy,  by  and  large,  is  secondary  to  the  utility  of  the 
forecasts.  But  measure  of  utility,  in  turn,  is  elusive.  It  would  require  a 
knowledge  of  customers  costs  and  or  profits,  which  are  much  too  changeable,  to 
say  nothing  about  differing  and  conflicting  interests.  We  are  left  with  the  goal 
to  determine,  or  uncover,  the  professional  skill  in  a  set  of  forecasts.  The  B-G 
system  presented  here  is  limited  to  this  specific  objective. 

Basically,  climatological  information  such  as  the  climatic  frequency  of  cloud 
cover  from  clear  to  overcast  should  be  available  to  a  forecaster.  In  the  case  of 
temperature,  say  at  Minneapolis,  noontime  in  January,  the  climatic  information 
will  include  the  frequency  distribution  of  the  temperatures,  ranging  in  this  case 
from  -36°C  with  l/lOO  of  1  percent  probability  through  the  median  of  -11°C  up  to 
1 0°C  with  l/lOO  of  1  percent  probability  of  exceedance. 

For  the  purpose  of  this  paper,  the  skill  of  the  forecaster  is  defined  as  his 
ability  to  recognize  and  to  quantify  the  probability  of  departure  of  a  future  event 
from  the  normal  climatic  frequency  of  the  event.  If,  in  the  case  of  Minneapolis 
temperature,  he  sees  a  strong  probability  of  the  later  temperature  -15°C,  and  so 
forecasts,  then  a  subsequent  verification  of  -15°C  should  earn  him  a  maximum 
positive  score,  and  somewhat  less  if  subsequently  the  temperature  is  -16°C  or 
-14°C. 

In  recent  years,  much  has  been  said,  as  well  as  done,  about  probability 
forecasting.  This  has  necessitated  another  type  of  skill— the  ability  to  state 
valid  probabilities.  A  line  of  demarcation  should  be  drawn  between  this  skill  and 
the  ability  of  the  forecaster  to  sharpen  the  prediction  toward  a  higher  degree  of 
certainty  of  one  outcome.  If,  say,  there  is  a  30  percent  climatic  probability  of 
rain,  then  the  quotation  of  50  percent  probability  of  rain  is  to  be  considered  as  a 
sharpening  of  the  odds  on  rain.  There  could  be  some  kind  of  verification  of  the 
50  percent  quotation,  but  it  is  not  the  intended  goal  of  this  paper. 

This  paper  seeks  to  evaluate  the  total  of  forecast  statements.  Such  evalua¬ 
tion  is  especially  desirable  for  long-range  forecasts,  seasonal  or  annual,  since 
the  verification  of  probability  statements  would  be  meaningless.  If  the  weather 
element  to  be  predicted  is  the  seasonal  winter  average  temperature,  predicted 
in  the  previous  fall  season,  the  forecaster  is  called  upon  to  give  his  best  single 
estimate  of  the  subsequent  winter  seasonal  average  temperature.  If  he  chooses 
-8°C  as  his  forecast  at  Minneapolis,  he  will  imply  a  warm  winter,  since  the 
median  winter  temperature  is  -11°C. 

For  this  scoring  system,  there  is  a  mandatory  premise  that  distinguishes  it 
from  most,  if  not  all  other,  scoring  systems:  The  unskilled  forecaster  should 
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not  benefit  from  any  unskilled  strategy  when  making  his  forecast.  No  quantity 
should  be  forecast  with  greater  expectation  of  a  score  than  any  other  quantity 
unless  it  be  done  for  meteorological  reasons,  extended  beyond  the  simple  knowl¬ 
edge  of  climatic  frequencies.  There  must  be  no  long-term  advantage  in  an 
unskilled  choice.  For  example,  if  the  unskilled  forecaster  predicts  no-rain  (NR), 
simply  because  it  is  more  frequent,  he  should  earn  the  same  average  score,  in 
100  such  forecasts,  as  he  would  if  he  predicted  rain  (R).  Symbolically,  if  the 
climatic  frequencies  are  P(NR)  and  P(R)  for  no-rain  and  rain,  respectively,  then 
the  scores,  S(NR)  and  S(R)  should  satisfy 

P(NR)  S(NR)  =  P(R)  S(R)  =  1  (l) 

A  skilled  forecaster,  on  the  other  hand,  must  pursue  the  analysis  of  the  synoptic 
weather  situation  until  he  detects  a  trend  or  a  better -than -usual  probability  of 
one  future  event.  Persistence,  as  a  strategy,  is  examined  in  Section  5. 

The  B-G  system,  presented  herein,  provides  a  score  for  each  individual 
forecast  and  an  evaluation  of  a  sufficiently  large  collection  of  forecasts. 


2.  SCORE 


If  the  weather  element  to  be  forecast  consists  of  two  alternatives,  A  and 
not-A,  with  climatic  frequencies  p  and  (1-p),  respectively,  then  in  accordance 
with  the  above  premise,  a  correct  forecast  should  earn  the  score  sA  or  snot.^ 
given  by 


snot-A  =  1/(1'P) 

Suppose,  on  the  other  hand,  that  the  weather  element  is  a  variable  (X)  whose 
values  range  continuously  from  low  to  high  values  with  cumulative  probability 
P(X<x)  symbolized  as  p.  Let  the  predicted  event  (F)  have  climatic  cumulative 
probability  Pp,  and  the  verifiably  observed  event  (V)  have  climatic  cumulative 
probability  Py.  If  the  variable  (X)  is  divided  dichotomously  at  x  where  the 
climatic  probability  is  p,  then:  for  p  less  than  both  Pj,  and  Py,  F  is  a  correct 
forecast  of  not-X,  and  the  score  is 


sp  =  l/(l-p) 
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For  p  greater  than  one  of  (Pp  or  Py ),  but  less  than  the  other,  F  is  an  incorrect 
forecast  and  the  score  is 

8P  =  ° 

For  p  greater  than  both  Pp  and  Py,  F  is  a  correct  forecast  of  X,  and  the  score 


sp  =  1/p 

For  all  dichotomous  divisions  the  score  average  (s'),  when  Py  <  Pp,  is  given  by 


s'  =  J  dp/(l-p)  +  J  dp/p 


That  is. 


s'  =  -In  {(1  -  Py)Fp}  for  Py  <  Pp 
Similarly, 

s'  =  -In  {(1  -  Pp)Py}  for  Py  a  Pp 

For  an  unskilled  forecast  the  expected  value  of  s'  is  1.  0.  A  score  of  zero, 
however,  is  preferred  for  no  skill.  Therefore,  the  score  (Spy)  for  the  forecast 
(F)  and  verification  (V)  is  chosen  to  be: 


SFV  =  ”^n  {(1  ”  ^V^F }  ~  1  ^or  ^”v  < 


=  -In  {(1  -  Pp)Py  }  -  1  for  Py  2  Pp 


To  illustrate  the  scores  (Spy),  consider  the  following:  In  Figure  1  it  is 
assumed  that  the  median  has  been  forecast,  (that  is,  Pp  =  0.  5).  The  score  is 
maximum  when  the  forecast  is  exactly  correct;  it  is  still  positive  when,  for 
verification,  0.27  £  Py  £  0.73;  it  is  negative  otherwise  and  lowest  when  either 
the  lowest  or  highest  extreme  verifies.  In  Figure  2,  it  is  assumed  that  the  lower 
quartile  has  been  forecast  (Pp  =  0.25>.  The  score  again  is  maximum  for  an 
exactly  correct  verification.  In  Figure  3,  it  is  assumed  that  the  forecast  calls 


Figure  1.  The  Score  (spy)  Plotted  Against  the  Probability  (Py) 
of  the  Observed  Event  (V)  When  the  Forecast  (F)  is  the  Median 
(Pp  =  0.  5) 
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Figure  2.  The  Score  (sv^  Plotted  Against  the  Probability  (Py) 
of  the  Observed  Event  (V)  When  the  Forecast  (F)  is  the  Lower 
Quartile  (P^  =  0.2  5) 
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Figure  3.  The  Score  (spy)  p*otted  Against  the  Probability  (Py)  of 
the  Observed  Event  (V)  When  the  Forecast  (F)  is  the  Five  Percen¬ 
tile  <PF  =  0.  05) 

for  an  extreme  low  condition  to  develop.  If  correct,  or  nearly  correct,  then  the 
score  is  large;  otherwise  the  score  drops  rapidly  to  negative  values  with  increas¬ 
ing  error.  In  all  cases,  however,  the  average,  or  expected,  no-skill  score  is 
zero. 

Figure  4  shows  the  scores  (s),  from  the  lowest  possible  to  the  perfect  versus 
the  forecast  (represented  by  Pp).  When  the  forecaster  predicts  the  median 
(Pp  =  0.  5),  he  will  earn  a  positive  score  when  the  median  event  verifies,  and  lose 
if  some  unusual  or  extreme  event  verifies,  although  his  gain  or  loss  will  be  modest. 
On  the  other  hand,  if  the  forecaster  predicts  an  unusual  event,  low  or  high,  he  can 
earn  an  unusually  high  score  (s),  while  risking  a  greater  negative  score.  Whether 
he  forecasts  the  median  or  an  extreme  event,  however,  if  he  is  only  guessing,  or 
uses  an  unskilled  system  of  forecasting,  the  expected  score  is  0.  0.  At  all  times 
the  perfect  score  is  achieved  when  Py  =  Pp.  The  average  of  the  perfect  scores 
is  1. 0. 

The  foregoing  scores  [Eq.  (3))  can  be  used  when  the  predictand  weather  ele¬ 
ment  varies  continuously,  theoretically  from  minus  infinity  to  infinity,  with  cumu¬ 
lative  probabilities  (Pp.  Py)  well  defined.  However,  when  we  must  begin  with  a 
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category,  such  as  a  substantial  probability  of  no  rain,  or  clear  sky  (o/lO),  or  end 
with  a  substantial  probability  of  overcast  (10/ 10),  we  must  modify  the  scoring 
system,  while  adhering  to  the  premise  that  unskilled  forecasting  must  not  benefit 
from  a  no-skill  strategy  (Appendix  B). 


Figure  4.  The  Range  of  Scores  (s)  From  the  Lowest  Possible  to  the 
Perfect  Score,  as  a  Function  of  the  Forecast  (Pp) 


3.  EVALUATION 

In  the  preceding  section,  each  pair  of  forecast  and  verification  (Fp,  Py) 
results  in  a  score  (Spy),  Evaluation  of  a  whole  forecast  system,  however,  must 
depend  on  a  large  set  (N)  of  forecasts  and  verifications.  Among  N  forecasts, 
there  will  be  high  and  low  scores.  Since  the  actual  score  depends  on  the  forecast 
itself,  as  well  as  on  its  excellence,  it  is  desirable  «.o  examine  the  score  of  the 
forecast  by  its  place  in  the  set  of  possible  scores  (Figures  1,  2,  or  3).  The  fore¬ 
caster  whose  scores  are  consistently  among  the  upper  10  percent  is  clearly  better 
than  the  forecaster  who  can  claim  only  that  his  scores  are  among  the  upper 
25  percent. 


Figure  5  shows  the  scores  for  the  whole  range  of  verifications  (0  —  Py  —  1) 
for  the  forecast  (Pp  =  0.  35).  For  the  verification  (Py2)>  a  sc°re  (sq)  by  pure 
chance  can  exceed  the  earned  score  with  fr  equency  Pyg.  ^  the  verifica¬ 

tion  is  closer  to  the  forecast  (Pp),  then  the  score  Is py)  is  not  exceeded  quite  as 
often  by  chance.  For  the  same  score  (Spy)  there  can  be  two  verifications: 

If  Py  >  Pp  then,  for  the  same  score,  Py^  ~  1  -  (1  -  Pp)  Py/Pp 

If  Py  s  Pp  then,  for  the  same  score,  Pyu  =  (1  -  Py)  Pp/(1  -  Pp) 

As  seen  from  the  diagram,  the  score  Spy  can  be  exceeded  by  chance  with  prob¬ 
ability  (Py  -  Py^  )  Or  (Pyu  '  Py)* 


Figure  5.  Illustrating  the  Distribution  of  Scores  (spy)  38  3  Func¬ 
tion  of  Observed  Event  (Py),  and  the  Likelihood  (LCS)  of  a  Chance 
Score  (s  )  to  Exceed  Spy  When  the  Forecast  is  Pp 


In  summary,  we  find  the  probability,  P(sQ  2  Spy),  that  a  chance  score  (sQ> 
can  equal  or  exceed  the  achieved  score  (spy),  as  follows  (Figure  5),  defining  LCS 
as  LCS  =  P(S.  2  Smr): 


LCS  -  Pv  when  PF  <  Py  ,  Pv  2  PF/(1  -  PF)  (6) 

-  (Py  *  PF)/PF  when  PF  <  Py  ,  Py  <  PF/(1  -  Pp) 

=  (Pp  -  Py)/(1  -  PF)  when  Pp  >  Py  ,  Py  >  2  -  l/Pp 

=  (1  -  Py)  when  PF  2  Py  ,  Py  <  2  -  l/Pp  • 

The  likelihood  of  a  chance  score  (LCS),  will  be  less  for  a  more  successful  fore¬ 
cast.  For  an  accurate  forecast  it  would  be  identically  zero;  for  a  "complete  bust" 
it  would  be  1.  0.  For  N  unskilled  forecasts,  we  expect  (N/lO)  forecasts  to  achieve 
scores  for  which  LCS  £  0.  1,  (2N/10)  forecasts  with  scores  for  which  LCS  £  0.2, 
and  so  on.  In  general,  we  would  expect  (i  N/lO)  unskilled  forecasts,  scores  for 
which  LCS  s  i/lO.  In  Figure  6  the  number  of  forecasts  (n)  is  plotted  against 
LCS  (P).  The  diagonal  straight  line  gives  the  expected  numbers  by  chance.  For 
example,  among  100  unskilled  forecasts  we  expect  40  to  have  scores  high  enough 
to  give  LCS  S  0.4.  In  a  test  (Figure  6)  in  which  100  random  numbers  were  used 
for  the  verification  (Py)  against  100  random  forecasts  (Pp),  the  number  of  scores 
in  the  upper  10  percent  were  9  (instead  of  10);  in  the  upper  20  percent  there  were 
17  (instead  of  2  0),  and  so  on,  as  shown  by  the  solid  curve  (Figure  6).  The  chi- 
square  test  (see  Section  4)  showed  no  significant  differences  between  the  solid 
curve  and  the  diagonal  straight  line.  In  two  other  tests  the  forecasts  were; 

(1)  consistently  the  median;  (2)  consistently  the  lowest  1-percentile,  with  the 
results  shown  by  the  broken  lines.  Again  the  chi-square  test  did  not  reveal  a 
significant  difference. 

In  contrast  with  the  tests  illustrated  in  Figure  6,  those  in  Figure  7  were  per¬ 
formed  with  100  pairs  of  forecasts  and  verifications  when  there  were  significant 
correlation  coefficient  (p)  between  them,  simulating  skillful  forecasting.  When 
p  =  0.  5,  there  were  18  forecasts,  out  of  100,  whose  scores  were  among  the  upper 
10  percent;  when  p  =  0.  95,  there  were  49  such  successful  forecasts;  when 
P  =  0.  99,  there  were  as  many  as  71  forecasts  that  succeeded  in  achieving  scores 
in  the  rare  10  percent  bracket. 

An  evaluation  of  the  set  of  N  forecasts  is  accomplished  by  finding  the  9  num¬ 
bers  (np)  for  P  =  0.  1(0.  1)0.  9,  knowing  that  they  must  exceed  the  numbers  (NP) 
expected  by  chance. 

Alternatively,  or  as  a  supplementary  evaluation,  the  average  value  of  the 
probability,  LCS,  will  be  useful  especially  in  the -comparison  of  two  sets  of  fore¬ 
casts.  A  proposed  evaluation  (E)  of  the  set,  ranging  from  -1.  0  for  worse-than- 
useless  to  1.  0  for  perfect,  is: 
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Figure  6.  The  Number  (n)  of  "Forecasts"  (out  of  100)  Whose  Scores 
had  Probability  (LCS)  of  Being  Equaled  or  Exceeded  by  Chance.  All 
"forecasts"  were  made  by  unskilled  procedures 


E  =  1  -  2  (LCS)  (7) 

Zero  for  E  would  imply  no  skill.  Consider  the  following  example:  For  75  years 
(1002  to  1070),  at  the  University  of  Wisconsin,  forecasts  of  the  average  tempera¬ 
ture  in  the  winter  season  (December,  January,  and  February)  were  made  for  some 
42  stations  on  or  before  30  November  of  each  year.  The  results  for  Winnipeg. 
Manitoba  (Figure  8)  show  that  the  forecasts  do  demonstrate  skill,  or  have  infor¬ 
mational  value  but  qualitatively,  since  there  was  not  a  significant  number  of 
forecasts  verifying  at  less  than  the  30-percentiie.  The  results  for  Chicago, 

Illinois  (Figure  !))  are  very  similar,  except  that,  without  the  supporting  evidence 
of  the  Winnipeg  performance,  these  results  would  have  been  viewed  as  indecisive, 
requiring  further  sampling  to  establish  conclusively  the  forecasting  skill  (see 
Section  4). 


Ill 


N=IOO 


Figure  7.  The  Number  (n)  of  "Forecasts"  (out  of  100)  Whose  Scores 
had  Probability  (LCS)  of  Being  Equaled  or  Exceeded  by  Chance.  The 
correlation  coefficients  (p)  between  "forecasts"  and  "observed"  events 
were  as  shown 


Again  the  problem  of  some  categories  of  weather,  such  as  no -rain,  must  be 
faced.  Modifications  of  Eq.  (6)  were  obtained  on  the  assumption  that  there  is  a 
category  (X)  with  climatic  frequency  (Pxu  -  Pxi ,  with  much  attention  to  details 
(Appendix  B). 


Figure  8.  The  Number  (n)  of  Forecasts  [out  of  7  5  Forecasts  (1903-1978)]  Whose 
Scores  had  Probability  (LCS)  of  Being  Equaled  or  Exceeded  by  Chance.  The 
example  is  for  the  average  winter  temperature  at  Winnipeg,  Manitoba 


Figure  9.  The  Number  (n)  of  Forecasts  [out  of  75  Forecasts  (1903-1978)]  Whose 
Scores  had  Probability  (LCS)  of  Being  Equaled  or  Exceeded  by  Chance.  The 
example  is  for  the  average  winter  temperature  at  Chicago,  Illinois 
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4.  C.IU  SQUARE  TEST  OF  SIGNIFICANCE 


Where  N  is  the  over-all  number  of  forecasts,  m  is  the  number  of  forecasts 
(i  =  0(1)9)  whose  values  of  LCS  lie  between  i/ 10  and  (i  +  l)/l 0,  then  chi-square 
with  9  degrees  of  freedom  is 

9 

X9  =  Z  <N/10  ‘  ni>2/ (N/ 10) 
i=o 


i 


If  the  forecasts  are  divided  dichotomously  at 


Zn.,  then 
3 

j=o 


iXl  =  (NPi  '  Z  n.j  /{P.d  -PM} 
j=o 


where 


Pj'O.lll  +  i)  ,  i  =  0(1)8 


The  5  percent  values  of  chi-square,  as  given  in  textbook  tables,  are 


Xg(0.  05)  =  16.9 


X^(0.  05)  =  3.  8 

For  the  numbers  (n.)  obtained  in  the  no-skill  tests  (Figure  6)  the  chi-square 
values  were  obtained  as  shown  (Table  1).  For  the  numbers  (n^)  obtained  when 
there  was  a  significant  correlation  Ip)  between  forecast  and  observed  events 
(Figure  7)  chi-square  (Table  2)  increased,  as  expected,  with  correlation  coeffi¬ 
cient.  The  significance  of  chi-square  also  improved  with  sample  size  (Table  3). 
Clearly,  for  smaller  p  it  will  take  larger  samples  to  establish  significant  skill. 

The  foregoing  test  is  one  of  several  alternatives,  and  its  choice  here  is  an 
arbitrary  one.  Other  tests  apply  to  Spy  as  opposed  to  LCS.  Future  work  might 
suggest  another  choice  for  test  of  significance. 


■  'i  fflrTn  • 
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Table  1.  Chi-Square  Values  With  One  (^Xp  and  Nine  (Xg)  Degrees  of  Freedom 
in  Tests  of  Three  Kinds  of  Unskilled  Forecast  Selections.  Each  sample  size  was 
100.  For  comparison,  under  the  no-skill  null  hypothesis,  the  5  percent  chi- 
square  with  one  degree  of  freedom  is  3.  8;  with  9  degrees  of  freedom  it  is  16.  9 


1  percentile 


Table  2.  Chi-Square  Values  With  One  and  Nine  Degrees  of  Freedom  in  Tests  in 
"Forecasts"  Whose  Correlation  Coefficient  With  the  "Observed"  Events  Varied 
from  0.  05  to  0.  99.  Each  sample  size  was  100.  The  broken  line  divides  signif¬ 
icant  chi-square  from  nonsignificant 


Correlation 

X2 

9 

between 

F  and  V 

o 

1 

2 

3 

4 

5 

6 

7 

8 

p  =  0.  05 

4.8 

0.0 

0.2 

0.2 

0.2 

0.0 

0.0 

0.8 

0.  6 

0.4 

0.25 

10.  6 

1.0 

2.2 

1.7 

0.4 

1.4 

5.0 

3.  9 

0.6 

0.4 

0.5 

24.2 

9.  0 

16.  1 

6.  9 

8.2 

11.  6 

12.0 

8.  0 

6.2 

5.4 

0.  95 

218.4 

169.0 

156.0 

149.  0 

117.  0 

96.  0 

66.  6 

42.  9 

25.  0 

11.  1 

0.  99 

449.0 

413.0 

315.0 

220.  0 

150.0 

100.  0 

66.7 

42.  9 

25.  0 

11.  1 

Table  3.  Test  of  Chi-Square  Values  With  Cae  and  Nine  Degrees  of  Freedom  in 
Tests  of  "Forecasts"  Whose  Correlation  Coefficient  With  'Observed"  Events  is 
0.25.  The  sample  size  was  varied  from  50  to  500.  The  broken  line  divides 
significant  chi-square  from  nonsignificant  at  the  5  percent  level 


X2 

X2 

*1 

oampie 

Size 

X9 

0 

1 

2 

3 

4 

5 

6 

7 

8 

50 

7.2 

0.  9 

0.  5 

2.4 

0.  8 

2.0 

3.  0 

2.4 

0.5 

0.  9 

100 

10.6 

1.0 

2.2 

1.7 

0.4 

1.4 

5.  0 

3.  9 

0.  6 

0.4 

150 

16.4 

6.0 

10.7 

7.  1 

2.  8 

2.7 

6.2 

3.8 

1.0 

0.3 

200 

17.4 

6.7 

10.  1 

8.6 

3.5 

3.4 

6.0 

4.7 

0.5 

0.9 

250 

17.9 

7.5 

9.  0 

6.  9 

4.8 

5.2 

8.  1 

6.2 

0.  6 

1.6 

300 

18.  1 

7.3 

h-» 

to 

o 

9.  1 

5.  6 

4.8 

6.7 

6.  3 

1.0 

1.8 

400 

28.0 

7.  1 

17.0 

14.6 

11.3 

7.8 

10.7 

13.  0 

2.  6 

3.4 

500 

24.0 

7.2 

17.  1 

17.6 

18.8 

6.7 

8.0 

8.6 

3.2 

3.8 

5.  DISCUSSION 

The  scoring  system  of  this  paper  is  developed  for  the  continuous  variable 
and  for  a  forecast  stated  specifically,  not  for  a  probability  forecast.  Errors  in 
forecasting  are  measured  by  the  likelihood,  by  nonskilled  methods  (LCS),  of  the 
achieved  score.  If  the  forecaster  habitually  earns  scores  in  the  upper  50  percent 
then  we  rate  him  as  skillful  (Appendix  C).  But  we  proceed  further,  and  examine 
his  consistency  in  earning  scores  exceeding  the  upper  40  percent,  30  percent, 

20  percent  or  10  percent.  Scores  for  nearly  perfect  forecasting  will  exceed  even 
the  upper  1  percent.  If  the  forecaster  consistently  earns  scores  in  the  upper 
10  percent  we  must  rate  him  exceptionally  good. 

Other  scoring  programs  gather  forecasts  and  verifications  in  categories 
such  as  the  following:  below  normal,  normal,  and  above  normal,  which  have 
equal  climatic  frequencies.  They  score  1.0  for  the  correctly  forecast  category 
and  zero  for  an  incorrect  category,  and  base  evaluation  on  the  number  of  correct 
forecasts.  Such  evaluation  would  be  in  accord  with  the  premise  as  stated,  but 
surely  we  must  agonize  over  the  tantalizing  feature  that  a  "below  normal"  pre¬ 
diction  would  be  counted  as  an  error  when  the  observed  verification  is  "low 
normal.  "  If  the  weather  should  verify  extremely  low,  a  forecast  of  "above 


normal"  would  be  badly  in  error,  but  a  score  of  0.  0  does  not  show  the  extent  of 
the  error.  Conversely,  there  is  never  a  high  reward  for  a  spectacularly  good 
forecast,  only  a  score  of  1.  0. 

Significant  persistence  in  the  weather  could  make  the  simple  unskilled  strat¬ 
egy,  that  is,  forecasting  the  present  weather  to  prevail  into  the  future,  result  in 
profitable  scores  (Spy).  To  avoid  rewarding  this  nonskilled  strategy,  the  fore¬ 
caster's  performance  can  be  judged  against  the  performance  of  persistence 
treated  as  the  standard.  The  advantage  in  using  simple  climatology  or  other 
unskilled  strategy  has  already  been  eliminated.  Chi-square  can  be  found  to  test 
the  significance  of  the  difference  between  any  curves  (Figures  6,  7,  8,  or  9). 

The  quantity  E  could  be  modified: 


E  =  1  -  P(sq  —  spv^/^So  “  spV^ 

Utility  of  the  forecasts  is  not  measured  directly  by  this  system.  It  is  rea¬ 
sonable,  however,  to  view  all  operations  in  a  given  climate  as  requiring  adjust¬ 
ment  to  that  climate.  As  a  simple  example,  consider  the  selection  of  the  clothes 
that  one  may  wear.  Since  the  skillful  forecaster  does  provide  the  sign  and  extent 
of  the  departure  of  the  weather  from  the  climatic  norm,  our  measure  of  that  skill 
becomes,  at  least  indirectly,  a  measure  of  the  utility  of  the  forecast. 

Since  the  system  of  evaluation  depends  heavily  upon  climatic  frequency  dis¬ 
tributions,  prior  knowledge  of  the  probabilities  (Pp,Py)  of  predicted  and  observed 
events  (F,V)  should  be  accurate.  If  the  climatological  information  is  faulty,  then 
clearly  the  errors  in  Pp  and  Py  would  cause  errors  in  the  evaluation  (Eq.  6). 
However,  as  long  as  such  errors  are  less  than  10  percent,  the  errors  in  LCS 
should  also  be  less  than  10  percent,  which  should  make  the  errors  in  the  counts 
(n.,  i  =  0.  9)  negligible  for  our  purpose. 

6.  OTHER  KEY  POINTS 

While  there  is  a  formula  for  scoring  of  forecasts  (Eq.  3),  the  scores  need  not 
be  calculated  for  the  primary  goal  of  evaluating  a  set  of  forecasts.  The  primary 
statistic  has  become  LCS  =  P(s  >  Spy),  given  by  Eq.  (6).  The  evaluation  (E) 
depends  upon  the  average  value  of  LCS.  The  test  of  significance  is  done  with  our 
old  friend,  in  fact  everybody's  old  friend.  Chi-square,  on  the  number  of  fore¬ 
casts  that  succeed  in  pinpointing  the  future  events. 
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Appendix  A 

Comment*  on  "Long-Range  Weather  Forecast  Evaluation" 
By  a  Special  Panel  of  The  National  Research  Council, 

4  April  1979 


A  National  Research  Council  (NRC)  special  panel  of  the  Committee  on  Atmos¬ 
pheric  Sciences  has  written  a  report,  dated  4  April  1979,  on  "Long  Range  Weather 
Forecast  Evaluation,  "  which  we  (AFGL/LYT)  have  examined,  at  the  request  of  the 
Air  Weather  Service.  The  NRC  report  contains  a  suggested  scoring  system  for 
the  forecasts  of  weekly  average  temperature  at  several  locations  in  the  United 
States,  forecast  one  month  in  advance. 

The  method  of  scoring  is  directed  at  uncovering  significant  skill  in  182  fore¬ 
casts  (F)  of  weekly  average  temperature  (Y)  of  26  alternate  weeks  of  one  year  at 
seven  widely  scattered  stations  in  the  United  States.  To  be  considered  skillful, 
the  forecasts  must  improve  on  persistence  (G),  the  weekly  average  temperature 
obtained  at  the  time  the  forecast  (F)  is  due. 

In  the  procedure  the  three  variables  (Y,F,G)  are  standardized,  or  normalized, 
into  (y, f.g),  which  should  make  them  all  have  zero  mean  (0.0)  and  unit  standard 
deviation  (1.  0).  However,  the  forecaster's  values  (F)  may  be  deliberately  biased, 
and  all  values  (Y,G)  are  to  be  found  by  sampling.  The  paper,  therefore,  allows 
for  nonzero  means  (y.T,  g)  and  nonunity  variances  symbolized  as  (yy) ,  (ff ]  ,  [gg] . 
The  covariances  are  (gy) ,  {gf } ,  and  [yf] . 

The  primary  statistics  for  evaluation  are  the  squares  of  the  correlation 


coefficients  (cc).  If  p  is  the  cc  between  y  and  g,  and  by.fg  is  the  multiple  cc  of 
y  on  f  and  g,  then  the  skill  score  (S)  is  given  by  the  difference: 
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Surprisingly,  the  NRC  paper  avoids  the  term  correlation  and  calls  S  the  "incre¬ 
mental  fractional  reduction  in  forecast  error  variance,  "  given  by 


s  =  Uygl  M  -  M  Iggl  )2 
Iggl  (yyl  (Iggl  Iff  1  -  f gf )  2 } 


(A2> 


This  number  is  always  positive,  skillful  forecasts  or  otherwise.  In  its  purest 
form,  if  [yy ]  =  (ff ]  =  [gg]  =  1  and  if  persistence  is  useless,  so  that  |yg]  =  |fg]  =  0, 
then 


S  '=  [yf]  2  (A3) 

which  reveals  that  the  score,  or  statistic  for  evaluation,  is  basically  the  sum 
total  of  the  products  of  each  pair  of  forecast  (f)  and  verified  event  (y).  Each  term 
(yf)  of  182  such  terms  contributes  to  the  measure  of  the  skill  (S)  and  therefore  is 
effectively  the  score  for  that  single  forecast.  It  becomes  useful,  therefore,  to 
examine  a  table  of  such  "scores"  for  the  individual  forecasts  (Table  Al). 


Table  Al.  The  NRC  Scores  (f  is  the  standardized  deviate  of  the  forecast;  y  is  the 
standardized  deviate  of  the  verified  event) 
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After  the  collection  of  one  year's  data  and  <  ah  ulation  ol  the  number  S,  t.\u 
questions  are  raised:  Is  S  signify  antly  large  ?  <  an  m  .  omaude  that  the  extended 
forecasts  an-  truly  skillful?  The  NKC  panel  proposes  to  find  the  probability  that 
S  will  be  equaled  or  exceeded  in  a  population  of  values  (S  )  for  nonskilled  fore¬ 
casts.  To  find  one  value  for  S  ,  the  182  values  ol  v  and  the  182  values  of  a  art 

o  -  ^ 

entered  together  with  182  randomly  selected  values  foi  f  in  iq.  (A 2).  Hepeated 
Uh  duo  times  this  exem  ist-  produces  a  probability  distribution  of  values  for  S  . 

If  S  is  large  enough  to  lie  within  the  upper  a  percent  of  the  S  -values,  then  the 
forecasts  might  be  considered  significantly  skillful 

A  criticism  of  the  NKC  approach  is  that  it  can  be  played.  There  has  been  no 
intentional  device  incorporated  into  the  system  to  eliminate  the  advantages  of  an 
unskilled  strategy  by  hedging  or  otherw  ise.  Let  us  examine  the  "scores"  in 
Table  A  1. 

Faced  with  the  need  to  make  a  forecast  lor  the  following  month,  the  forecaster 
can  examine  the  potential  rewards  and  penalties  in  Table  Al.  If  he  is  completely 
uncertain,  he  might  be  tempted  to  forecast  the  median  temperature  (f  =  0),  since 
he  will  not  risk  punishment,  whatever  extreme  develops.  If  he  leans  to  colder 
weather,  he  ought  not  to  lean  too  far,  because  by  choosing  "moderate  cold" 

(f  =  -1)  he  will  gain  points  substantially  if  it  becomes  very  cold  without  his  having 
to  forecast  "very  cold.  "  His  reward  Is  not  greatest  for  an  accurate  forecast. 

In  contrast,  the  scores,  by  the  method  of  this  paper,  are  presented  (Table  A2) 
for  the  same  forecasts  and  verifications  as  in  Table  Al. 


Table  A2.  The  Gringorten  Scores  (f  is  the  standardized  deviate  of  the  forecast; 
y  is  the  standardized  deviate  of  the  verified  event) 


■ 

O 

CO 

1 

-2.  0 

-1.  0 

0 

1.  0 

2.  0 

3.  0 

-3.  0 

5.  61 

2.78 

0.  84 

-0.  31 

-0.  83 

-0.  98 

-0.  9973 

-2.  0 

2.78 

2. 81 

0.  86 

-0.  28 

-0.  80 

-0.  95 

-0.  98 

-1.0 

0.  84 

0.  86 

1.  01 

-0.  13 

-0.  65 

-0.  80 

-0.  83 

0 

-0.  31 

-0.28 

-0.  13 

0.  39 

-0.  13 

-0.  28 

0.  31 

1.  0 

-0.  83 

-0.  80 

-0.  65 

-0.  13 

1.  01 

0.  86 

0.  84 

2.  0 

-0.  98 

-0.  95 

-0.  80 

-0.28 

0.  86 

2.  81 

2.  78 

3.  0 

-0,  9973 

-0.  98 

-0.  83 

-0.  31 

0.  84 

2.  78 

5.  61 
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Appendix  B 

Scoring  and  Evaluation  With  a  Category 
(For  Example,  No  Measurable  Rain) 


Figure  Bl  is  the  same  as  Figure  2  of  the  text  except  that  an  interval  in  the 
distribution  between  Pxi  and  Pxu  corresponds  to  a  class  of  weather  (X)  whose 
climatic  probability  is  (Pxu  *  Px|).  There  should  be  one  score  for  this  category 
^xSFV^'  a^though  not  necessarily  the  arithmetic  average.  It  is  such  that,  overall. 

For  Px/  <  Pv  ^  Pxu  and  for  Pp  >  Py, 


-/n{<!  -  Pv  )PF}  -  1]  + 


x  f 


1  -  P. 


XU 


1  + 


p  -  p  . 

XU  xf 


f  n 


1  -  P. 


xf 


(Bl) 


which  corresponds  to  the  probability  (Py)  given  by 


P.  =  ,  J  e'^^  /Pt 


For  PF  <  Pv, 


tsFV  =  (-fn{(l  -PpjP^}  -  1]  + 


xf 


P  -  P  , 

XU  xi 


In 


XU 

3 

xf 


(B2) 


(B3) 
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Pv - ► 

Figure  Bl.  Showing  the  Distribution  of  Scores  When  There  is  a 
Category  of  Events  from  P^l  to  P^u 


which  corresponds  to  the  probability  given  by 


pi  = 


*(l+xsFV 


/(I  -  PF> 


(B4) 


In  the  case  of  no-rain,  with  frequency  =  ^xu'  PXi  =  the  score  f°r  a  f°re 
cast  of  a  certain  amount  of  rain,  when  in  fact  it  does  not  rain,  should  be 


NRSFV 


[-/n{(l  -  PNR>PF}  -  1]  + 


In  (1  - 


PNR) 


(B5> 


Up  to,  and  including,  the  categorized  event,  the  cumulative  probability  should 

be  P  .  With  this  in  mind,  the  likelihood,  LCS  =  P(s  a  s,,,,),  that  a  chance 
xu  o  r  v 

score  (sQ)  can  equal  or  exceed  the  achieved  score  (sFy),  follows.  First,  a  lew 
more  terms  need  to  be  defined,  in  addition  to  those  terms  already  given: 


pvu  =  <1  -  Pv')PF/(l  -  PF) 

(B6) 

Pv’l  =  1  -  (1  -  PF>Pv-/PF 

(B7) 

If  PF  <  PV 

If  Py  5=  PF/(1  -  Pp)  ,  then  LCS  =  Py 
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If  Pv  <  PF/(1 

■  Pp5 

UPV-Pxi  * 

then  LCS  =  (Pv  -  PF>/Pp 

“  PV  >  Pxi 

If  PV  *  Pxu 

(actually  Py  =  Pxu>  then  LCS 

If  PV  >  Pxu 

If  Pxu  "  PV, 

,  then  LCS  =  Py  -  Py< 

If  P  2  P  . 

XU  v< 

If  p  .  >  p  . 
v<  xf 

If  SFV  2  XSFV 

,  then  LCS  =  Py  -  Pxy 

If  SFV  <  xs  FV 

,  then  LCS  =  Py  -  Px< 

If  P  <  P  . 
vt  xf 

,  then  LCS  =  Py  -  PV/ 

If  PF  2  pv 

If  Pv  2  2  -  1/PF 

«PV  £Pxf 

“  Pxf  £  Pvu 

If  P  .  <  P 
xf  vu 

If  P  s  P 

vu  XU 

,  then  LCS  =  Pyu  -  Py 

If  SFV  >  xSFV 

,  then  LCS  =  P^  -  Py 

If  SFV  S  xSFV 

,  then  LCS  =  Pxu  -  Py 

If  P  >  P 

VU  XU 

,  then  LCS  =  Pyu  -  Py 

If  Pv  >  Pxf 

■f  Pv  s  Pxu  (actually  Py  =  Pxu>  then  L,  S  -  Py,u  -  Px< 
(Note:  If  Pviu  computes  greater  than  1.0,  set  it  to  1-  0) 
If  P  >  P 

V  XU 


then  Lt'S  =  (PF  -  Pv)/(1  -  PF) 


Step  1. 


Step  2. 
Step  3. 


Appendix  C 

Program  Steps  in  The  B-G  System  of  Evaluating  forecasts 
of  Continuous  Weather  Elements  and  Chi-Square  Tests 

of  Significance 


Initialize:  N  =  0 

m  =  0  i  *  0(1)9 

t  (LCS).  =  Z  P(S  2  Sw)  =  0 
.  j  o  FV 

Set  Ng:  sample  size 

Enter  the  (next)  forecast  and  verification,  respectively,  as  Pp-Py 

Find  LCS  =  P  (So  =>  SFV)  =  Pv  ,  if  PF  <  Pv,  Pv  SPF/(1  -PF) 

=  (Pv  -  PF)/PF  ,  if  Pp  <  Pv,  Py  <  PF/ 
<1  -  Pp) 

*  (PF  -  Pv)/(1  -  Pp)  .  if  PF  2  Pv, 

Pv  s  2  -  1/PF 

=  1  *  Py  ,  if  Pp  >  Py,  Py  <  2  *  1/Py 
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Step  4. 


Step  5. 


Step  6. 


Step  7. 


N 

Add  P(So  >  Spv)  to  £  {P(So>SFV)}j 

j  =  l 

Add  1  to  N 

Find  i  =  int  {10.  P(Sq  »SFV>} 

Add  1  to  n. 

If  N  >  Ng  ,  go  to  Step  6. 

If  N  <  Ng  ,  go  to  Step  2. 

N 

Find  P(Sq  >  Spv)  =  X  {P(SQ  ^Sfv)}./N 

j=l 


Find  E  =  1  -  2  •  P(SQ  2;  Spv> 

9 

Find  Xg  -  ^  (N/10  -  n^)2/(N/lO) 
i=0 


Find  p.  =  (i  +  1)N/10  i  =  0(1)8 


Find  .X2 


i  \2 

Z  ni  /iPi<l  -pt)N} 
j  =  0  / 
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